Happiness Dataset

The World Happiness Report is a landmark survey of the state of global happiness.The happiness scores and rankings use data from the Gallup World Poll (GWP). The scores are based on answers to the main life evaluation question asked in the poll. This question, known as the Cantril ladder, asks respondents to think of a ladder with the best possible life for them being a 10 and the worst possible life being a 0 and to rate their own current lives on that scale.

Further, the Happiness Report includes additional 6 factors (levels of GDP, life expectancy, generosity, social support, freedom, and corruption) which show the estimated extent to which each of the six factor is estimated to contribute to making life evaluations (happiness score) higher in each country than in Dystopia. The underlying raw datapoints for those estimations are provided by other organisations (e.g. WHO) or from the Gallup World Poll question results. Dystopia in this context, is a hypothetical country with values equal to the world’s lowest national averages for each of the six factors raw values. The purpose in establishing Dystopia is to have a benchmark against which all countries can be favorably compared (no country performs more poorly than Dystopia) in terms of each of the six key variables. Since life would be very unpleasant in a country with the world’s lowest incomes, lowest life expectancy, lowest generosity, most corruption, least freedom, and least social support, it is referred to as “Dystopia,” in contrast to Utopia.

Thus, each of the 6 factors values explain the contribution of each factor for the higher happiness score in a certain country than in Dystopia. That is why the happiness score can be calculated by: \[\sum_{i=1}^{6} factorvalue_i + dystopiahappiness + residual \]

This makes it clear, that the 6 factors are already the result of some sort of estimation and therefore cannot be used for analysing the variable importance. The resulting regression coefficients e.g. would not be helpful at all, as by including the residual in the dataset, the interception would be 0 and all the coefficients would result in 1.

That is why we looked for an additional version of the happiness dataset, which includes the actual raw values and which we can therefore use for analysing the variable importance and use in data dimension reduction steps.

Report questions based on the Happiness Datasets

Based on the happiness dataset we want to try to answer the follwing leading questions. ### What influences Happiness? Can happiness be explained by certain factors? What are those factors and how much do they influence the happiness? For this questions we need the raw values to build our analysis on top. To answer this questions we decided to add additional factors which might explain the different happiness levels. We were interested in how drug abuse correlates with happiness and found suiting datasets for alcohol consumption and tabaco consumtion. Additionally we were intereseted in how the modern user of social media influeces happiness. However we only found a fitting internet dataset which captures the percentage of the individuals in a country which is using the Internet. ### Happiness over time? For the change of happiness we can use the plain happiness dataset as it captures the happiness scores and the explained by parts for the 6 factors over time. Therefore we can calculate an visualize the changes over time.

Datasets and Pre-Processing

For answering our two main questions we decided for the given reasons to create two datasets. ### Over Time Dataset For the answering the happiness change over time we used the data from the World Happiness Report which is ranging from 2015 to 2022. An example of 2015 can be seen in the table below.
Country Region Happiness.Rank Happiness.Score Standard.Error Economy..GDP.per.Capita. Family Health..Life.Expectancy. Freedom Trust..Government.Corruption. Generosity Dystopia.Residual
Switzerland Western Europe 1 7.587 0.03411 1.39651 1.34951 0.94143 0.66557 0.41978 0.29678 2.51738
Iceland Western Europe 2 7.561 0.04884 1.30232 1.40223 0.94784 0.62877 0.14145 0.43630 2.70201
Denmark Western Europe 3 7.527 0.03328 1.32548 1.36058 0.87464 0.64938 0.48357 0.34139 2.49204
To get our final dataset we had to do some preprocessing, which included joining the individual year files, renaming the columns and cleaning the data (region, NaN). The dataset has 1185 rows and 11 columns. The resulting over time dataset can be seen below.
Country Happiness.Rank Happiness Economy Family Health Freedom Trust Generosity Year Region
Switzerland 1 7.587 1.39651 1.34951 0.94143 0.66557 0.41978 0.29678 2015 Western Europe
Iceland 2 7.561 1.30232 1.40223 0.94784 0.62877 0.14145 0.43630 2015 Western Europe
Denmark 3 7.527 1.32548 1.36058 0.87464 0.64938 0.48357 0.34139 2015 Western Europe

Influential Factors Dataset

For answering the questions “What influences happiness?” we had to use the raw data of the factors and not their “explained by” values. In addition, we wanted to add futher factors and added the following three datasets:

By merging the datasets we have now four additional factors.

To join all the different datasets we had to do some preprocessing which can be seen in the preprocessing step. The main steps where cleaning the data (region, countrycode, NaN) and joining the datasets based on the year and the countrycode.

After joining we noticed, that the three additional data sets do not contain data for the whole timespan 2015-2022.(fig. missing values full data) Therefore, we decided to use only one year for analysing the influential factors.

missing values full data

We inspected the missing values of each year and choose the year with the lowes missing values, year 2018 (fig “missing values 2018”). Then we excluded all rows containing missing values again. Figure “missing values 2017” shows e.g. that the smoking and the alcohol dataset did not contain any values for the year 2017. We also renamed the columns for having shorter labels.

The final influential factors dataset consists of 96 rows (countries for the year 2018) and 18 columns which quickly explained. A more detailed explanation can be seen in the Statistical Appendix of the world happiness report.

  • Country
  • Year
  • Happiness: happiness score
  • Economy: Log GDP per capita
  • Social: (support) national average of the binary responses (either 0 or 1) to the GWP question
  • Health: Healthy life expectancy at birth from WHO
  • Freedom: Freedom to make life choices, national average of responses to the GWP question
  • Generosity: residual of regressing national average of response to the GWP question
  • Corruption: national average of the survey responses to two questions in the GWP, (either 0 or 1)
  • Positive: (affect) defined as the average of three positive affect measures in GWP: happiness, laugh and enjoyment
  • Negative: (affect) defined as the average of three negative affect measures in GWP: they are worry, sadness and anger
  • Government: Confidence in national government
  • Code: Country code
  • Alcohol: Total alcohol consumption per capita (liters of pure alcohol, projected estimates, 15+ years of age)
  • Population: Population (historical estimates)
  • Tobacco: Prevalence of current tobacco use (% of adults)
  • Internet: Individuals using the Internet (% of population)
Country Region Year Happiness Economy Social Health Freedom Generosity Corruption Positive Negative Government Code Alcohol Population Tobacco Internet
Albania Central and Eastern Europe 2018 5.004403 9.412399 0.6835917 68.7 0.8242123 0.0053850 0.8991294 0.7132996 0.3189967 0.4353380 ALB 7.17 2882735 29.2 65.40000
Argentina Latin America and Caribbean 2018 5.792797 9.809972 0.8999116 68.8 0.8458947 -0.2069366 0.8552552 0.8203097 0.3205021 0.2613523 ARG 9.65 44361150 21.8 77.70000
Armenia Commonwealth of Independent States 2018 5.062449 9.119424 0.8144490 66.9 0.8076437 -0.1491087 0.6768264 0.5814877 0.4548403 0.6708276 ARM 5.55 2951741 26.7 68.24505

missing values 2017

missing values 2018

Preliminary Analyses

One of the objectives of preliminary data analysis to get a feel for the data you are dealing with by describing the key features of the data and summarizing the results. We are focusing on the second dataset, the influential factors dataset, which includes the raw values and not the explained by values.

Boxplots and Data Scaling

First we check via the summary how all the explanatory variables are distributed. As we can see they are on different scales, especially “Health”,“Population” and “Internet”. As we don’t want to have data reduction analysis be more driven on the larges distances, we scale them by \(\frac{(x - mean(x))}{sd(x)}\)

##    Happiness        Economy           Social           Health     
##  Min.   :3.335   Min.   : 6.630   Min.   :0.5035   Min.   :48.20  
##  1st Qu.:4.702   1st Qu.: 8.570   1st Qu.:0.7396   1st Qu.:59.85  
##  Median :5.536   Median : 9.669   Median :0.8581   Median :66.80  
##  Mean   :5.597   Mean   : 9.394   Mean   :0.8220   Mean   :65.23  
##  3rd Qu.:6.340   3rd Qu.:10.346   3rd Qu.:0.9130   3rd Qu.:71.20  
##  Max.   :7.858   Max.   :11.454   Max.   :0.9660   Max.   :75.00  
##     Freedom         Corruption       Generosity          Positive     
##  Min.   :0.5286   Min.   :0.1506   Min.   :-0.33638   Min.   :0.4347  
##  1st Qu.:0.7245   1st Qu.:0.6849   1st Qu.:-0.14312   1st Qu.:0.6427  
##  Median :0.8084   Median :0.7989   Median :-0.02550   Median :0.7353  
##  Mean   :0.7945   Mean   :0.7255   Mean   :-0.01767   Mean   :0.7114  
##  3rd Qu.:0.8784   3rd Qu.:0.8559   3rd Qu.: 0.07377   3rd Qu.:0.8000  
##  Max.   :0.9699   Max.   :0.9520   Max.   : 0.49938   Max.   :0.8836  
##     Negative        Government         Alcohol         Population       
##  Min.   :0.1580   Min.   :0.07971   Min.   : 0.019   Min.   :6.042e+05  
##  1st Qu.:0.2132   1st Qu.:0.33120   1st Qu.: 4.280   1st Qu.:6.028e+06  
##  Median :0.2749   Median :0.50385   Median : 7.410   Median :1.585e+07  
##  Mean   :0.2845   Mean   :0.50944   Mean   : 7.221   Mean   :5.380e+07  
##  3rd Qu.:0.3509   3rd Qu.:0.64084   3rd Qu.:10.570   3rd Qu.:5.042e+07  
##  Max.   :0.5438   Max.   :0.98812   Max.   :15.090   Max.   :1.353e+09  
##     Tobacco         Internet    
##  Min.   : 4.60   Min.   : 8.00  
##  1st Qu.:13.90   1st Qu.:30.80  
##  Median :22.80   Median :68.25  
##  Mean   :22.21   Mean   :59.34  
##  3rd Qu.:27.95   3rd Qu.:81.62  
##  Max.   :45.50   Max.   :97.32

We can see that every factor is now on the same scale. We have some outliers for Corruption, Generosity and Population.

Correlation Matrix

On the correlation matrix plot we see, that happiness has the strongest correlation with Economy (0.801), Internet (0.786), Social (0.768) and Health (0.767). For the correlations between the explanatory variables the following stand out:

  • 0.938: Economy and Internet
  • 0.878: Health and Internet
  • 0.875: Economy and Health
  • 0.807: Economy and Social
  • 0.791: Social and Internet
  • 0.753: Social and Health
  • 0.659: Freedom and Positive
  • -0.646: Social and Negative

What influences happiness?

In this chapter we try to answer the question “What influences happiness?” by several methods of influential factors analysis.

Regression

One tool for getting a first glance on what influences happiness is linear regression. For the regression we use the unscaled data. If our linear model has good predictability, we can interpret the coefficients on how they influence the outcome. This is also called regression analysis, where the goal is to isolate the relationship between each explanatory variable and the outcome variable.

However, the interpretability assumes that you can only change the value of one explanatory variable and not the others at the same time. This of course is only true if there are no correlations between the explanatory variables. If this independence does not hold, we have a problem of multicollinearity. This can result in the coefficients swingging wildly based on which other independent variables are in the model. Therefore the coefficients become very sensitive to small changes in the model and can not be easily interpreted.

One way to asses how strong the explanatory variables are affected by multicollinearity is using the variance inflation factor (VIF). VIFs identify correlations and their strength. VIFs between 1 and 5 suggest that there is a small correlation, VIFs greater than 5 represent critical levels of multicollinearity where the coefficients are poorly estimated.

If we build a linear regression model on all explanatory variables, we get an R-squared of 0.8063. However, by plotting the VIF values we can see that a model based on all explanatory variables has severe multicollinearity. Therefore we can not interprete the coefficients for Internet, Health and Economy.

## 
## Call:
## lm(formula = Happiness ~ ., data = not_scaled_data_factors)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.60190 -0.24719  0.00124  0.28565  1.79684 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -1.527e+00  1.487e+00  -1.027   0.3076   
## Economy      3.317e-01  1.628e-01   2.037   0.0449 * 
## Social       3.251e+00  9.952e-01   3.266   0.0016 **
## Health       7.641e-03  1.971e-02   0.388   0.6993   
## Freedom      1.404e+00  8.833e-01   1.589   0.1159   
## Corruption  -1.247e+00  4.577e-01  -2.724   0.0079 **
## Generosity   7.633e-01  4.282e-01   1.783   0.0784 . 
## Positive     6.045e-01  7.901e-01   0.765   0.4465   
## Negative     2.332e+00  9.192e-01   2.537   0.0131 * 
## Government  -9.855e-01  4.520e-01  -2.180   0.0321 * 
## Alcohol     -3.825e-03  1.898e-02  -0.202   0.8407   
## Population  -3.861e-10  4.241e-10  -0.910   0.3654   
## Tobacco     -1.165e-02  7.295e-03  -1.596   0.1143   
## Internet     6.001e-03  7.110e-03   0.844   0.4012   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.541 on 81 degrees of freedom
## Multiple R-squared:  0.8063, Adjusted R-squared:  0.7752 
## F-statistic: 25.94 on 13 and 81 DF,  p-value: < 2.2e-16

If we build a linear regression model without Internet and Economy, we get an R-squared of 0.7745. This R-squared is lower than prior, but after plotting the VIF values we can see that we are allowed to interpret the coefficients for the remaining explanatory variables, as all VIF values are below 5.

Interesting is that only Social, Health, Corruption, Negative and Government are statistically significant:

  • Social has an positive effect on the Happiness Score. One unit change on Social results in an absolute Happiness increase of 4.646
  • Health has an positive effect on the Happiness Score. One unit change on Health results in an absolute Happiness increase of 0.05214
  • Corruption has an negative effect on the Happiness Score. One unit change on Corruption results in an absolute Happiness decrease of -1.616
  • Negative has an positive effect on the Happiness Score. One unit change on Negative results in an absolute Happiness increase of 1.927
  • Government has an negative effect on the Happiness Score. One unit change on Government results in an absolute Happiness decrease of -1.016
## 
## Call:
## lm(formula = Happiness ~ . - Internet - Economy, data = not_scaled_data_factors)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.63299 -0.30363 -0.02198  0.34810  2.08143 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.609e+00  1.357e+00  -1.186 0.238858    
## Social       4.646e+00  9.788e-01   4.746 8.54e-06 ***
## Health       5.214e-02  1.626e-02   3.207 0.001908 ** 
## Freedom      8.769e-01  9.265e-01   0.946 0.346660    
## Corruption  -1.616e+00  4.667e-01  -3.463 0.000847 ***
## Generosity   4.041e-01  4.430e-01   0.912 0.364406    
## Positive     8.449e-01  8.287e-01   1.020 0.310893    
## Negative     1.927e+00  9.682e-01   1.990 0.049879 *  
## Government  -1.016e+00  4.768e-01  -2.132 0.035974 *  
## Alcohol      4.612e-03  2.003e-02   0.230 0.818514    
## Population  -1.135e-10  4.242e-10  -0.268 0.789649    
## Tobacco     -8.494e-03  7.700e-03  -1.103 0.273137    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5766 on 83 degrees of freedom
## Multiple R-squared:  0.7745, Adjusted R-squared:  0.7446 
## F-statistic: 25.92 on 11 and 83 DF,  p-value: < 2.2e-16

Next we tried out a linear regrssion method with shrinkage. For the lasso regression some estimates can become exactly zero. The result is therfore a type of variable selection and makes the model sparse and easier to interpret. For Lasso regression all predictor variables should be scaled so that they have the same standard deviation. Otherwise, the predictor variables have weighting in the penalty term. The glmnet() function however standardizes the predictors by default and the output coefficients are recalculated to apply to the original scale.

## [1] "Lasso Regression"
## 12 x 1 sparse Matrix of class "dgCMatrix"
##                      s1
## (Intercept)  0.30404144
## Social       3.22444813
## Health       0.04668353
## Freedom      .         
## Corruption  -0.56261151
## Generosity   .         
## Positive     0.00795680
## Negative     .         
## Government   .         
## Alcohol      .         
## Population   .         
## Tobacco      .

The results of the lasso regression confirm our results from the normal regression for Social, Health and Corruption. However Positive is added and Negative and Government is removed from the model.

  • Social.support: 3.35096335
  • Health: 0.04886770
  • Corruption: -0.68012247
  • Positive.affect: 0.22545789

PCA and Biplot

Die Hauptkomponentenanalyse geht von der Annahme aus, dass es bei stark korrelierten Größen eine dritte Größe gibt, die nicht direkt messbar ist und die hinter diesen korrelierten Variablen steht und sich quasi in ihnen äußert. Das bedeutet, die messbaren Größen sind nur eine andere Erscheinungsform von Größen, die im Hintergrund stehen und nicht direkt gemessen werden können. Man nennt diese im Hintergrund stehenden Größen Hauptkomponenten (Principal Components), Latent Variables oder Faktoren. Ziel der Hauptkomponentenanalyse ist es, solche Hintergrundgrößen bzw. Faktoren aus den gemessenen Daten zu ermitteln und die beobachteten Zusammenhänge möglichst vollständig zu erklären. Mit Hilfe der Hauptkomponentenanalyse lassen sich demzufolge komplexe Informationen auf nur wenige, orthogonale Informationen verdichten.

Die Hauptkomponentenanalyse bestimmt die Faktoren nach rein mathematischen Gesichtspunkten. Da der erste Faktor immer in die Richtung der maximalen Varianz in den Daten zeigt, werden dadurch die real gemessen Informationen am besten repräsentiert.

sing a sample of six hundred participants, linear regression model was fitted and collinearity between predictors was detected using Variance Inflation Factor (VIF). After confirming the existence of high relationship between independent variables, the principal components was utilized to find the possible linear combination of variables that can produce large variance without much loss of information. Thus, the set of correlated variables were reduced into new minimum number of variables which are independent on each other but contained linear combination of the related variables. In order to check the presence of relationship between predictors, dependent variables were regressed on these five principal components. The results show that VIF values for each predictor ranged from 1 to 3 which indicates that multicollinearity problem was eliminated.

For the PCA we are using the scaled factors without the happiness score. The first two PCs explain 59.01 % of the variation together.

PC1 explains 39.07 % of the variation and the coefficients are the following:

\[PC1=-0.415*Economy+-0.397*Social+-0.395*Health+-0.174*Freedom+0.192*Corruption \\ +0.115*Generosity+-0.182*Positive+0.317*Negative+0.132*Government+-0.289*Alcohol \\ +0.069*Population+-0.164*Tobacco+-0.411*Internet\]

The first PCA plot colored by the rounded happiness scores, clusters the countries quite good. For low values on PC1 and PC2 we the really high happiness scores. The top 3 countries for 2018 (Finland, Denmark and Switzerland) are all in that region. Also interestting is that most of the countries in the lower left are from ‘Western Europe’, expecpt of ‘New Zealand’, ‘Australia’ and ‘Canada’ with are from ‘North America and ANZ’. When we move from left to right, the happiness scores decrease. The values 8,7,6,5,4 are quite good seperated. An exeption is the happiness category of 3. They are spread out on the right half side of the plot.

An interesting outlier ist Benin (BEN) on the middle right. Benin belongs to the happiness category 6 but is on the verry right side. Another outlier ist Botswana (BWA) which belongs to the happiness category 3 but is in the verry middle.

With the coefficients and the

coefficients

PC2 explains 19.94% of the variation and the coefficients are the following:

\[PC2=0.059*Economy+0.014*Social+0.054*Health+-0.478*Freedom+0.388*Corruption \\ +-0.396*Generosity+-0.384*Positive+0.108*Negative+-0.467*Government+0.054*Alcohol \\ +-0.078*Population+0.246*Tobacco+0.103*Internet\]

##                    PC1         PC2
## Economy    -0.41453243  0.05900996
## Social     -0.39693364  0.01390851
## Health     -0.39520812  0.05439823
## Freedom    -0.17412078 -0.47755922
## Corruption  0.19210827  0.38831406
## Generosity  0.11453564 -0.39633395
## Positive   -0.18164843 -0.38383237
## Negative    0.31674709  0.10827902
## Government  0.13158791 -0.46723155
## Alcohol    -0.28868291  0.05382308
## Population  0.06882637 -0.07785665
## Tobacco    -0.16380250  0.24605983
## Internet   -0.41052241  0.10282514

##        factor  PC coefficient correlation
## 1     Economy PC1 -0.41453243 -0.93422477
## 2      Social PC1 -0.39693364 -0.89456266
## 3      Health PC1 -0.39520812 -0.89067389
## 4     Freedom PC1 -0.17412078 -0.39241307
## 5  Corruption PC1  0.19210827  0.43295117
## 6  Generosity PC1  0.11453564  0.25812705
## 7    Positive PC1 -0.18164843 -0.40937802
## 8    Negative PC1  0.31674709  0.71384759
## 9  Government PC1  0.13158791  0.29655745
## 10    Alcohol PC1 -0.28868291 -0.65059982
## 11 Population PC1  0.06882637  0.15511284
## 12    Tobacco PC1 -0.16380250 -0.36915894
## 13   Internet PC1 -0.41052241 -0.92518744
## 14    Economy PC2  0.05900996  0.09500783
## 15     Social PC2  0.01390851  0.02239312
## 16     Health PC2  0.05439823  0.08758282
## 17    Freedom PC2 -0.47755922 -0.76888490
## 18 Corruption PC2  0.38831406  0.62519748
## 19 Generosity PC2 -0.39633395 -0.63810974
## 20   Positive PC2 -0.38383237 -0.61798182
## 21   Negative PC2  0.10827902  0.17433252
## 22 Government PC2 -0.46723155 -0.75225703
## 23    Alcohol PC2  0.05382308  0.08665679
## 24 Population PC2 -0.07785665 -0.12535158
## 25    Tobacco PC2  0.24605983  0.39616383
## 26   Internet PC2  0.10282514  0.16555160

ggdat <-  data.frame(X=pca$x[,1],Y=pca$x[,2])
#ggdat$indiv_id <- as.factor(ggdat$indiv_id)
ggdat$group_id <- as.factor(round(not_scaled_data_factors$Happiness))



ggplot(ggdat) +
  geom_point(aes(x=X, y=Y,color=group_id),size=1) + # 
  stat_ellipse(aes(x=X, y=Y,,color=group_id, group=group_id),type = "norm") +
  theme(legend.position='none')
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Warning: Removed 2 row(s) containing missing values (geom_path).

library("FactoMineR")
## Warning: Paket 'FactoMineR' wurde unter R Version 4.1.3 erstellt
library("factoextra")
## Warning: Paket 'factoextra' wurde unter R Version 4.1.3 erstellt
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
res.pca <- PCA(scaled_data_factors[,correlation_categories_without_happy], graph = FALSE)
fviz_pca_var(res.pca, col.var = "contrib",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07")
             )

#corrplot(var$cos2, is.corr=FALSE)
fviz_pca_ind(res.pca,
             geom.ind = "point", # show points only (nbut not "text")
             col.ind = as.factor(happiness_category), # color by groups
             palette = c("#00AFBB", "#E7B800", "#FC4E07","#00AFBB", "#E7B800", "#FC4E07"),
             addEllipses = TRUE, # Concentration ellipses
             legend.title = "Groups"
             )
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse

PLS

Die Methode der PLS berechnet eine Regression von vielen unabhängigen x-Variablen auf eine oder mehrere y-Variablen. Der Unterschied zur Multilinearen Regression ist der, dass die x-Variablen hoch korreliert und interkorreliert sein dürfen. Auch können es viel mehr x-Variable als Objekte geben und trotzdem kann die Regression berechnet werden.

Auch bei der PLS Regression werden die x-Variablen in die Matrizen S und F zerlegt, wie bei der PCA. Allerdings wird bei dieser Zerlegung in die Hauptkomponenten für x die Zielgröße y schon mit einbezogen.

library(pls)
## Warning: Paket 'pls' wurde unter R Version 4.1.3 erstellt
## 
## Attache Paket: 'pls'
## Das folgende Objekt ist maskiert 'package:corrplot':
## 
##     corrplot
## Das folgende Objekt ist maskiert 'package:stats':
## 
##     loadings
#fit PLSR model
modelpls <- plsr(Happiness  ~., ncomp = 11, data = scaled_data_factors, validation="CV")



# RMSEP score for the first model
plot(RMSEP(modelpls), legendpos = "topright")

#make the analysis reproducible
set.seed(200)

#fit PLSR model
model <- plsr(Happiness  ~., ncomp = 2, data = scaled_data_factors, validation="CV")

# Bi-plot of scores
biplot(model, comps = 1:2, which = "scores", cex = 0.6, main = "")

# biplot of the scores
biplot(model, comps = 1:2, asp=1 ,cex = 0.6, main = "", col = c(happiness_category,"black"))# which = "y"

# check the summary
summary(model)
## Data:    X dimension: 95 13 
##  Y dimension: 95 1
## Fit method: kernelpls
## Number of components considered: 2
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps
## CV           1.005   0.5505   0.5263
## adjCV        1.005   0.5493   0.5239
## 
## TRAINING: % variance explained
##            1 comps  2 comps
## X            38.85    52.58
## Happiness    72.13    77.18

SOM

SOM Fanplot (2015)

Mappings for SOM ### What further influences happiness?

#box <- ggplot(data_2018, aes(x = Region, y = Happiness, color = Region), ) +
#  geom_boxplot() + 
#  geom_jitter(aes(color=Country), size = 0.5) +
#  ggtitle("Happiness Score for Regions and Countries") + 
#  coord_flip() + 
#  theme(legend.position="none")
#ggplotly(box)

Tobacco Consumption

## Warning: Paket 'viridis' wurde unter R Version 4.1.2 erstellt
## Warning: Paket 'tidyverse' wurde unter R Version 4.1.2 erstellt
## Warning: Paket 'tibble' wurde unter R Version 4.1.3 erstellt
## Warning: Paket 'tidyr' wurde unter R Version 4.1.2 erstellt
## Warning: Paket 'purrr' wurde unter R Version 4.1.2 erstellt
## Warning: Paket 'dplyr' wurde unter R Version 4.1.3 erstellt
## Warning: Paket 'stringr' wurde unter R Version 4.1.2 erstellt
## Warning: Paket 'forcats' wurde unter R Version 4.1.2 erstellt
## Warning: Paket 'ggpubr' wurde unter R Version 4.1.3 erstellt
## Warning: Removed 157 rows containing non-finite values (stat_smooth).

How does happiness change over time?

Animation

Alt Text

geography map (color each country base on the percentage change over time (2015-2022))

## Warning: Paket 'pals' wurde unter R Version 4.1.3 erstellt

Future work